-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add script that computes and prints the three different ways to make … #2
Conversation
…ANI/distance/containment matrices
I see at least one of the issues -- by default As for the will continue looking... |
as an aside, we might want to come back to the sourmash code to automatically enable |
@bluegenes that does make sense with jaccard vs containment. Apologies if I overlooked something, but remind me when/where you asked for my thoughts on average containment?
Agreed! |
Not written anywhere -- it was during our last meeting. I mentioned that I was using it for all my comparisons to mapping ANI values, and wanted to hear any caveats/issues you could think of around using it more widely /returning it directly from sourmash. The deeper question was on how to handle confidence intervals (if at all) for average containment. Since I'm taking the average of the containment ANI values, I'm not generating confidence intervals for avg containment.
With sourmash-bio/sourmash#2056, I enabled |
The last major difference I see is that I didn't actually change our Using dkoslicki#1, the matrices look much more alike, though there are a couple cases with large discrepancies. I've tracked the big discrepancies down, and they're a result of One thing I could do that would not involve an issue with semantic versioning -- I could use the containment bias factor just during ANI estimation, and continue ignoring it for non-ani containment for now. Testing in sourmash-bio/sourmash#2057. |
mamba env for installations; use compare --avg-containment --ani
And @mahmudhera it might make sense to move the discussion here instead of Tessa's PR on my fork. That way the trail is easier to follow |
You are correct. To make it easier to track things: The issue has been solved partly by switching to sourmash latest branch (after Tessa's update to use avg-containment in sourmash), which resulted in an agreement between sourmash ANI estimate, and Tessa's ANI estimate. The minor disagreement between my code in this repository and sourmash compare (reported here by Tessa) was solved by using the same seed (previously, we were using different seeds). Currently, the script compare_matrices.py uses different seeds to compute ANI, and prints the largest % of the relative difference, which is ~4.5%. I think previously, we had > 50% relative difference in some cases because (a) Jaccard->ANI, (b) using hardcoded thresholds in sourmash (which are dropped for now). |
…ANI/distance/containment matrices
Basically, we have Mahmudur's code, Tessa's code and Titus's code which all purport to compute a matrix of ANI (or distance, easily converted) values. Unfortunately, as this script demonstrates, all three lead to different results.
For example, for one pair of genomes, these approaches gave respectively ANI values of: 0.6489, 0.0, and 0.3244.